Detecting Text Similarity over Short Passages: Exploring Linguistic Feature Combinations via Machine Learning

نویسندگان

  • Vasileios Hatzlvassiloglou
  • Judith L. Klavans
  • Eleazar Eskin
چکیده

We present a new composite similarity metric that combines information from multiple linguistic indicators to measure semantic distance between pairs of small textual units. Several potential features are investigated and an optireal combination is selected via machine learning. We discuss a more restrictive definition of similarity than traditional, document-level and information retrieval-oriented, notions of similarity, and motivate it by showing its relevance to the multi-document text summarization problem. Results from our system are evalua ted against standard information retrieval techniques, establishing that the new method is more effective in identifying closely related textual units. 1 R e s e a r c h G o a l s In this paper, we focus on the problem of detecting whether two small textual units (paragraphor sentence-sized) contain common information, as a necessary step towards extracting such common information and constructing thematic groups of text units across multiple documents. Identifying similar pieces of text has many applications (e.g., summarization, information retrieval, text clustering). Most research in this area has centered on detecting similarity between documents [Willet 1988], similarity between a query and a document [Salton 1989] or between a query and a segment of a document [Callan 1994]. While effective techniques have been developed for document clustering and classification which depend on inter-document similarity measures, these techniques mostly rely on shared words, or occasionally collocations of words [Smeaton 1992]. When larger units of text are compared, overlap may be sufficient to detect similarity; but when the units of text are small, simple surface matching of words and phrases is less likely to succeed since the number of potential matches is smaller. Our task differs from typical text matching applications not only in the smaller size of the text units compared, but also in its overall goal. Our notion of similarity is more restrictive than topical similarity--we provide a detailed definition in the next section. We aim to recover sets of small textual units from a collection of documents so that each text unit within a given set describes the same action. Our system, which is fully implemented, is further motivated by the need for determining similarity between small pieces of text across documents that potentially span different topics during multi-document summarization. It serves as the first component of a domain-independent multidocument summarization system [McKeown et al. 1999] which generates a summary through text reformulation [Barzilay et al. 1999] by combining information from these similar text passages. We address concerns of sparse data and the narrower than topical definition of similarity by exploring several linguistic features, in addition to shared words or collocations, as indicators of text similarity. Our pr imi t ive features include linked noun phrases, WordNet synonyms, and semantically similar verbs. We also define composite features over pairs of primitive features. We then provide an effective method for aggregating the feature values into a similarity measure using machine learning, and present results

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Architecture for Detecting Phishing Webpages using Cost-based Feature Selection

Phishing is one of the luring techniques used to exploit personal information. A phishing webpage detection system (PWDS) extracts features to determine whether it is a phishing webpage or not. Selecting appropriate features improves the performance of PWDS. Performance criteria are detection accuracy and system response time. The major time consumed by PWDS arises from feature extraction that ...

متن کامل

Using Machine Learning Algorithms for Automatic Cyber Bullying Detection in Arabic Social Media

Social media allows people interact to express their thoughts or feelings about different subjects. However, some of users may write offensive twits to other via social media which known as cyber bullying. Successful prevention depends on automatically detecting malicious messages. Automatic detection of bullying in the text of social media by analyzing the text "twits" via one of the machine l...

متن کامل

Ontology-guided feature engineering for clinical text classification

In this study we present novel feature engineering techniques that leverage the biomedical domain knowledge encoded in the Unified Medical Language System (UMLS) to improve machine-learning based clinical text classification. Critical steps in clinical text classification include identification of features and passages relevant to the classification task, and representation of clinical text to ...

متن کامل

A Paraphrase and Semantic Similarity Detection System for User Generated Short-Text Content on Microblogs

Existing systems deliver high accuracy and F1-scores for detecting paraphrase and semantic similarity on traditional clean-text corpus. For instance, on the clean-text Microsoft Paraphrase benchmark database, the existing systems attain an accuracy as high as 0.8596. However, existing systems for detecting paraphrases and semantic similarity on user-generated short-text content on microblogs su...

متن کامل

Discovering Links Between Lexical and Surface Features in Questions and Answers

Information retrieval systems, based on keyword match, are evolving to question answering systems that return short passages or direct answers to questions, rather than URLs pointing to whole pages. Most open-domain question answering systems depend on elaborately designed hierarchies of question types. A question is first classified to a fixed type, and then hand-engineered rules associated wi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999